<?xml version="1.0" encoding="utf-8"?>
<!--
                                                                                     
 h       t     t                ::       /     /                     t             / 
 h       t     t                ::      //    //                     t            // 
 h     ttttt ttttt ppppp sssss         //    //  y   y       sssss ttttt         //  
 hhhh    t     t   p   p s            //    //   y   y       s       t          //   
 h  hh   t     t   ppppp sssss       //    //    yyyyy       sssss   t         //    
 h   h   t     t   p         s  ::   /     /         y  ..       s   t    ..   /     
 h   h   t     t   p     sssss  ::   /     /     yyyyy  ..   sssss   t    ..   /     
                                                                                     
	<https://y.st./>
	Copyright © 2016 Alex Yst <mailto:copyright@y.st>

	This program is free software: you can redistribute it and/or modify
	it under the terms of the GNU General Public License as published by
	the Free Software Foundation, either version 3 of the License, or
	(at your option) any later version.

	This program is distributed in the hope that it will be useful,
	but WITHOUT ANY WARRANTY; without even the implied warranty of
	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
	GNU General Public License for more details.

	You should have received a copy of the GNU General Public License
	along with this program. If not, see <https://www.gnu.org./licenses/>.
-->
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
	<head>
		<base href="https://y.st./en/weblog/2016/01-January/08.xhtml" />
		<title>An error in the handling of relative URIs &lt;https://y.st./en/weblog/2016/01-January/08.xhtml&gt;</title>
		<link rel="icon" type="image/png" href="/link/CC_BY-SA_4.0/y.st./icon.png" />
		<link rel="stylesheet" type="text/css" href="/link/basic.css" />
		<link rel="stylesheet" type="text/css" href="/link/site-specific.css" />
		<script type="text/javascript" src="/script/javascript.js" />
		<meta name="viewport" content="width=device-width" />
	</head>
	<body>
		<nav>
			<p>
				<a href="/en/">Home</a> |
				<a href="/en/a/about.xhtml">About</a> |
				<a href="/en/a/contact.xhtml">Contact</a> |
				<a href="/a/canary.txt">Canary</a> |
				<a href="/en/URI_research/"><abbr title="Uniform Resource Identifier">URI</abbr> research</a> |
				<a href="/en/opinion/">Opinions</a> |
				<a href="/en/coursework/">Coursework</a> |
				<a href="/en/law/">Law</a> |
				<a href="/en/a/links.xhtml">Links</a> |
				<a href="/en/weblog/2016/01-January/08.xhtml.asc">{this page}.asc</a>
			</p>
			<hr/>
			<p>
				Weblog index:
				<a href="/en/weblog/"><abbr title="American Standard Code for Information Interchange">ASCII</abbr> calendars</a> |
				<a href="/en/weblog/index_ol_ascending.xhtml">Ascending list</a> |
				<a href="/en/weblog/index_ol_descending.xhtml">Descending list</a>
			</p>
			<hr/>
			<p>
				Jump to entry:
				<a href="/en/weblog/2015/03-March/07.xhtml">&lt;&lt;First</a>
				<a rel="prev" href="/en/weblog/2016/01-January/07.xhtml">&lt;Previous</a>
				<a rel="next" href="/en/weblog/2016/01-January/09.xhtml">Next&gt;</a>
				<a href="/en/weblog/latest.xhtml">Latest&gt;&gt;</a>
			</p>
			<hr/>
		</nav>
		<header>
			<h1>An error in the handling of relative <abbr title="Uniform Resource Identifier">URI</abbr>s</h1>
			<p>Day 00307: Friday, 2016 January 08</p>
		</header>
<p>
	I awoke this morning to find that the spider had choked on a bad <abbr title="Uniform Resource Identifier">URI</abbr> that it had pulled from its own database.
	It assumes that all <abbr title="Uniform Resource Identifier">URI</abbr>s in its database are valid, but somehow, it had managed to put an invalid <abbr title="Uniform Resource Identifier">URI</abbr> there.
	Before attempting to diagnose the problem, I made sure to check my email to see if the school had written me back though.
	They had not.
	With that out of the way, I ran a query against the database to find the offensive page so that I could run tests on my <code>merge_uris()</code> function, which had to have returned invalid results for this to happen.
	The link was to <code>https://5jp7xtmox6jyoqd5.onion</code>, so I just searched for the page that linked to it.
	Only one result came up: <a href="http://52wdeibt3ivmcapq.onion/darknet.html"><code>http://52wdeibt3ivmcapq.onion/darknet.html</code></a>.
	This page contains many technically-invalid <abbr title="Uniform Resource Identifier">URI</abbr>s that my spider should have successfully sanitized.
	Or rather, my <code>merge_uris()</code> function should have successfully sanitized them.
	Running another query against the database, I found that this page alone had been allowed to add eight <abbr title="Uniform Resource Identifier">URI</abbr>s with no path component to my database; <abbr title="Uniform Resource Identifier">URI</abbr>s that are technically invalid and which, by feature, not bug, would choke <code>merge_uris()</code> when used as the &quot;absolute <abbr title="Uniform Resource Identifier">URI</abbr>&quot; parameter of that function.
	But before reaching the database, they should have been given as the &quot;relative <abbr title="Uniform Resource Identifier">URI</abbr>&quot; parameter, causing the function to return the <abbr title="Uniform Resource Identifier">URI</abbr> with a slash at the end, making the <abbr title="Uniform Resource Identifier">URI</abbr>s valid.
	To make sure that it was in fact a bug in the function and not the spider, I tested the function on every invalid <abbr title="Uniform Resource Identifier">URI</abbr> that the page had added to my database.
	The function returned the expected incorrect result every time.
</p>
<p>
	I added a call to <code>\var_dump()</code> right before the return statement, but it was not getting executed.
	Searching for another spot where the function returns, I immediately found the problem.
	If the relative <abbr title="Uniform Resource Identifier">URI</abbr> has a different scheme that the absolute <abbr title="Uniform Resource Identifier">URI</abbr> that it is merged with, the relative <abbr title="Uniform Resource Identifier">URI</abbr> is assumed to be absolute.
	In theory, this should be accurate.
	If the scheme is different, relative <abbr title="Uniform Resource Identifier">URI</abbr>s should not be used at all, so the hyperlink should point to an absolute <abbr title="Uniform Resource Identifier">URI</abbr>.
	In practice though, some webmasters do not respect this, or even do not know it.
	It does not help that many Web browsers stupidly hide the path of a <abbr title="Uniform Resource Identifier">URI</abbr> when the path is only a slash.
	All this does is breed ignorance that the <abbr title="Uniform Resource Identifier">URI</abbr> even has the trailing slash, resulting in more people writing bad hyperlinks, such as those on the page that messed up my database.
	That said, my function should account for bad input as well, at least if the bad input is said to be a relative <abbr title="Uniform Resource Identifier">URI</abbr>.
	This function&apos;s whole purpose is to take incomplete <abbr title="Uniform Resource Identifier">URI</abbr>s and make them whole in an automated way.
	I think I managed to fix the function, then I used a short script to repair the damage to the database so I would not have to throw out the whole database.
</p>
<p>
	In the process of scanning the database for errors, I found several erroneous <abbr title="Internet Relay Chat">IRC</abbr> <abbr title="Uniform Resource Identifier">URI</abbr>s, which got me thinking.
	At some point, I should scan the database and see how many onion-based <abbr title="Internet Relay Chat">IRC</abbr> networks I can find.
</p>
<p>
	Upon the next run of the spider, I found that it was requesting <abbr title="File Transfer Protocol">FTP</abbr> and Gopher pages.
	I thought that I had implemented a protocol whilelist and only allowed <abbr title="Hypertext Transfer Protocol Secure">HTTPS</abbr>- and <abbr title="Hypertext Transfer Protocol">HTTP</abbr>-based <abbr title="Uniform Resource Identifier">URI</abbr>s, but it seems that I neglected to make the spider actually check the whitelist.
	That means that the spider will be requesting pages that it does not know how to handle yet.
	However, because of the new logic flow that allows use of the MySQL database, if I fix the protocol whitelist feature, the spider will look endlessly.
</p>
<p>
	Having needed to update my base library, I put aside work on the spider, despite its current issues, to work on a wrapper class, as I had agreed to include at least new wrapper class in every update.
	I have also decided to restructure the library version numbers a bit.
	Currently, the version numbers are increasing quickly, making it look like I am making more progress on it than I am.
	I will also be holding the version numbers back a bit until progress catches up with the version number.
	After building the <a href="https://secure.php.net/manual/en/ref.fdf.php">FDF</a> wrapper class, I started work on a <a href="https://secure.php.net/manual/en/ref.ftp.php"><abbr title="File Transfer Protocol">FTP</abbr></a> wrapper class.
	I found though that some <abbr title="File Transfer Protocol">FTP</abbr> functions rely on file resources, so I would need to complete a <a href="https://secure.php.net/manual/en/function.fopen.php">file</a> wrapper class first.
	However, for this wrapper class, I would need a wrapper class for stream resources.
	Looking into stream resources, I found that there is a documented prototype for implementing <a href="https://secure.php.net/manual/en/class.streamwrapper.php">stream objects</a>.
	My understanding of this class prototype though is that it does not meet my goals.
	Instead of wrapping up a stream resource with its related functions, it instead replaces stream resources altogether.
	I might try creating a class that both implements this prototype and wraps up the functions I want wrapped up.
	This class prototype though, and probably stream resources in general, requires stream context support, so I moved on and built the a class wrapping stream context resources.
</p>
<p>
	For my own reference, current work that needs to be done on the spider includes:
</p>
<ul>
	<li>find a way to avoid trying to crawl uncrawlable <abbr title="Uniform Resource Identifier">URI</abbr>s without creating an endless loop</li>
	<li>replace mis-implemented protocol whitelist with a <code>switch()</code> statement to allow handling different protocols in different ways</li>
	<li>restructure program flow to cause <code>&lt;a/&gt;</code>s to be scanned for and saved before the <code>&lt;title/&gt;</code> is recorded to make spider interruptions to no longer be detrimental</li>
	<li>Fix handling of <code>&lt;a/&gt;</code>s that have child nodes</li>
</ul>
<p>
	My <a href="/a/canary.txt">canary</a> still sings the tune of freedom and transparency.
</p>
		<hr/>
		<p>
			Copyright © 2016 Alex Yst;
			You may modify and/or redistribute this document under the terms of the <a rel="license" href="/license/gpl-3.0-standalone.xhtml"><abbr title="GNU&apos;s Not Unix">GNU</abbr> <abbr title="General Public License version Three or later">GPLv3+</abbr></a>.
			If for some reason you would prefer to modify and/or distribute this document under other free copyleft terms, please ask me via email.
			My address is in the source comments near the top of this document.
			This license also applies to embedded content such as images.
			For more information on that, see <a href="/en/a/licensing.xhtml">licensing</a>.
		</p>
		<p>
			<abbr title="World Wide Web Consortium">W3C</abbr> standards are important.
			This document conforms to the <a href="https://validator.w3.org./nu/?doc=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2016%2F01-January%2F08.xhtml"><abbr title="Extensible Hypertext Markup Language">XHTML</abbr> 5.1</a> specification and uses style sheets that conform to the <a href="http://jigsaw.w3.org./css-validator/validator?uri=https%3A%2F%2Fy.st.%2Fen%2Fweblog%2F2016%2F01-January%2F08.xhtml"><abbr title="Cascading Style Sheets">CSS</abbr>3</a> specification.
		</p>
	</body>
</html>

